Goto

Collaborating Authors

 edge environment


CoMoE: Collaborative Optimization of Expert Aggregation and Offloading for MoE-based LLMs at Edge

Li, Muqing, Li, Ning, Yuan, Xin, Xu, Wenchao, Chen, Quan, Guo, Song, Zhang, Haijun

arXiv.org Artificial Intelligence

--The proliferation of large language models (LLMs) has driven the adoption of Mixture-of-Experts (MoE) architectures as a promising solution to scale model capacity while controlling computational costs. However, deploying MoE models in resource-constrained mobile edge computing environments presents significant challenges due to their large memory footprint and dynamic expert activation patterns. T o address these challenges, we propose a novel dynamic resource-aware collaborative optimization framework that jointly optimizes expert aggregation granularity and offloading strategies based on real-time device resource states, network conditions, and input characteristics in mobile edge environments, denoted as CoMoE. In CoMoE, we first systematically analyze existing expert aggregation techniques, including expert parameter merging, knowledge distillation, and parameter sharing decomposition, identifying their limitations in dynamic mobile environments. We then investigate expert offloading strategies encompassing expert prediction and prefetching, expert caching and scheduling, and multi-tier storage architectures, revealing the interdependencies between routing decisions and offloading performance. The CoMoE incorporates adaptive scheduling mechanisms that respond to user mobility and varying network conditions, enabling efficient MoE deployment across heterogeneous edge devices. Extensive experiments on real mobile edge testbeds demonstrate that CoMoE achieves approximately 70% reduction in memory usage compared to baseline methods, 10.5% lower inference latency than existing expert offloading techniques, while maintaining model performance stability. For large-scale MoE models (e.g., 7.4B-parameter Switch-Base-128), the CoMoE reduces memory requirements from 15.6GB to 4.7GB, enabling deployment on resource-constrained mobile edge devices that previously could only support much smaller models. With the rapid advancement of artificial intelligence technology, Large Language Models (LLMs) have demonstrated unprecedented capabilities in natural language processing, computer vision, and other domains. However, as model scales continue to expand, computational efficiency and memory constraints have become critical challenges in practical model deployment. The Mixture of Experts (MoE) architecture emerges as a promising solution that effectively scales the model capacity while controlling computational costs through sparse activation mechanisms.


Jupiter: Fast and Resource-Efficient Collaborative Inference of Generative LLMs on Edge Devices

Ye, Shengyuan, Ouyang, Bei, Zeng, Liekang, Qian, Tianyi, Chu, Xiaowen, Tang, Jian, Chen, Xu

arXiv.org Artificial Intelligence

--Generative large language models (LLMs) have garnered significant attention due to their exceptional capabilities in various AI tasks. Traditionally deployed in cloud datacenters, LLMs are now increasingly moving towards more accessible edge platforms to protect sensitive user data and ensure privacy preservation. The limited computational resources of individual edge devices, however, can result in excessively prolonged inference latency and overwhelmed memory usage. While existing research has explored collaborative edge computing to break the resource wall of individual devices, these solutions yet suffer from massive communication overhead and under-utilization of edge resources. Furthermore, they focus exclusively on optimizing the prefill phase, neglecting the crucial autoregressive decoding phase for generative LLMs. T o address that, we propose Jupiter, a fast, scalable, and resource-efficient collaborative edge AI system for generative LLM inference. Jupiter introduces a flexible pipelined architecture as a principle and differentiates its system design according to the differentiated characteristics of the prefill and decoding phases. For prefill phase, Jupiter submits a novel intra-sequence pipeline parallelism and develops a meticulous parallelism planning strategy to maximize resource efficiency; For decoding, Jupiter devises an effective outline-based pipeline parallel decoding mechanism combined with speculative decoding, which further magnifies inference acceleration. Extensive evaluation based on realistic implementation demonstrates that Jupiter remarkably outperforms state-of-the-art approaches under various edge environment setups, achieving up to 26. 1 end-to-end latency reduction while rendering on-par generation quality. I NTRODUCTION The emergence of generative large language models (LLMs) has attracted widespread attention from both industry and academia owing to their exceptional capabilities in a wide range of artificial intelligence (AI) tasks. These models, widely deployed in cloud datacenters equipped with powerful server-grade GPUs, have driven increasing intelligent edge applications such as ChatBot [1] and smart-home AI agent [2].


CHESTNUT: A QoS Dataset for Mobile Edge Environments

Zou, Guobing, Zhao, Fei, Hu, Shengxiang

arXiv.org Artificial Intelligence

Quality of Service (QoS) is an important metric to measure the performance of network services. Nowadays, it is widely used in mobile edge environments to evaluate the quality of service when mobile devices request services from edge servers. QoS usually involves multiple dimensions, such as bandwidth, latency, jitter, and data packet loss rate. However, most existing QoS datasets, such as the common WS-Dream dataset, focus mainly on static QoS metrics of network services and ignore dynamic attributes such as time and geographic location. This means they should have detailed the mobile device's location at the time of the service request or the chronological order in which the request was made. However, these dynamic attributes are crucial for understanding and predicting the actual performance of network services, as QoS performance typically fluctuates with time and geographic location. To this end, we propose a novel dataset that accurately records temporal and geographic location information on quality of service during the collection process, aiming to provide more accurate and reliable data to support future QoS prediction in mobile edge environments.


Asteroid: Resource-Efficient Hybrid Pipeline Parallelism for Collaborative DNN Training on Heterogeneous Edge Devices

Ye, Shengyuan, Zeng, Liekang, Chu, Xiaowen, Xing, Guoliang, Chen, Xu

arXiv.org Artificial Intelligence

On-device Deep Neural Network (DNN) training has been recognized as crucial for privacy-preserving machine learning at the edge. However, the intensive training workload and limited onboard computing resources pose significant challenges to the availability and efficiency of model training. While existing works address these challenges through native resource management optimization, we instead leverage our observation that edge environments usually comprise a rich set of accompanying trusted edge devices with idle resources beyond a single terminal. We propose Asteroid, a distributed edge training system that breaks the resource walls across heterogeneous edge devices for efficient model training acceleration. Asteroid adopts a hybrid pipeline parallelism to orchestrate distributed training, along with a judicious parallelism planning for maximizing throughput under certain resource constraints. Furthermore, a fault-tolerant yet lightweight pipeline replay mechanism is developed to tame the device-level dynamics for training robustness and performance stability. We implement Asteroid on heterogeneous edge devices with both vision and language models, demonstrating up to 12.2x faster training than conventional parallelism methods and 2.1x faster than state-of-the-art hybrid parallelism methods through evaluations. Furthermore, Asteroid can recover training pipeline 14x faster than baseline methods while preserving comparable throughput despite unexpected device exiting and failure.


RED-CT: A Systems Design Methodology for Using LLM-labeled Data to Train and Deploy Edge Classifiers for Computational Social Science

Farr, David, Manzonelli, Nico, Cruickshank, Iain, West, Jevin

arXiv.org Artificial Intelligence

Large language models (LLMs) have enhanced our ability to rapidly analyze and classify unstructured natural language data. However, concerns regarding cost, network limitations, and security constraints have posed challenges for their integration into work processes. In this study, we adopt a systems design approach to employing LLMs as imperfect data annotators for downstream supervised learning tasks, introducing novel system intervention measures aimed at improving classification performance. Our methodology outperforms LLM-generated labels in seven of eight tests, demonstrating an effective strategy for incorporating LLMs into the design and deployment of specialized, supervised learning models present in many industry use cases.


Galaxy: A Resource-Efficient Collaborative Edge AI System for In-situ Transformer Inference

Ye, Shengyuan, Du, Jiangsu, Zeng, Liekang, Ou, Wenzhong, Chu, Xiaowen, Lu, Yutong, Chen, Xu

arXiv.org Artificial Intelligence

Transformer-based models have unlocked a plethora of powerful intelligent applications at the edge, such as voice assistant in smart home. Traditional deployment approaches offload the inference workloads to the remote cloud server, which would induce substantial pressure on the backbone network as well as raise users' privacy concerns. To address that, in-situ inference has been recently recognized for edge intelligence, but it still confronts significant challenges stemming from the conflict between intensive workloads and limited on-device computing resources. In this paper, we leverage our observation that many edge environments usually comprise a rich set of accompanying trusted edge devices with idle resources and propose Galaxy, a collaborative edge AI system that breaks the resource walls across heterogeneous edge devices for efficient Transformer inference acceleration. Galaxy introduces a novel hybrid model parallelism to orchestrate collaborative inference, along with a heterogeneity-aware parallelism planning for fully exploiting the resource potential. Furthermore, Galaxy devises a tile-based fine-grained overlapping of communication and computation to mitigate the impact of tensor synchronizations on inference latency under bandwidth-constrained edge environments. Extensive evaluation based on prototype implementation demonstrates that Galaxy remarkably outperforms state-of-the-art approaches under various edge environment setups, achieving up to 2.5x end-to-end latency reduction.


ECLM: Efficient Edge-Cloud Collaborative Learning with Continuous Environment Adaptation

Zhuang, Yan, Zheng, Zhenzhe, Shao, Yunfeng, Li, Bingshuai, Wu, Fan, Chen, Guihai

arXiv.org Artificial Intelligence

Pervasive mobile AI applications primarily employ one of the two learning paradigms: cloud-based learning (with powerful large models) or on-device learning (with lightweight small models). Despite their own advantages, neither paradigm can effectively handle dynamic edge environments with frequent data distribution shifts and on-device resource fluctuations, inevitably suffering from performance degradation. In this paper, we propose ECLM, an edge-cloud collaborative learning framework for rapid model adaptation for dynamic edge environments. We first propose a novel block-level model decomposition design to decompose the original large cloud model into multiple combinable modules. By flexibly combining a subset of the modules, this design enables the derivation of compact, task-specific sub-models for heterogeneous edge devices from the large cloud model, and the seamless integration of new knowledge learned on these devices into the cloud model periodically. As such, ECLM ensures that the cloud model always provides up-to-date sub-models for edge devices. We further propose an end-to-end learning framework that incorporates the modular model design into an efficient model adaptation pipeline including an offline on-cloud model prototyping and training stage, and an online edge-cloud collaborative adaptation stage. Extensive experiments over various datasets demonstrate that ECLM significantly improves model performance (e.g., 18.89% accuracy increase) and resource efficiency (e.g., 7.12x communication cost reduction) in adapting models to dynamic edge environments by efficiently collaborating the edge and the cloud models.


AdaptiveNet: Post-deployment Neural Architecture Adaptation for Diverse Edge Environments

Wen, Hao, Li, Yuanchun, Zhang, Zunshuai, Jiang, Shiqi, Ye, Xiaozhou, Ouyang, Ye, Zhang, Ya-Qin, Liu, Yunxin

arXiv.org Artificial Intelligence

Deep learning models are increasingly deployed to edge devices for real-time applications. To ensure stable service quality across diverse edge environments, it is highly desirable to generate tailored model architectures for different conditions. However, conventional pre-deployment model generation approaches are not satisfactory due to the difficulty of handling the diversity of edge environments and the demand for edge information. In this paper, we propose to adapt the model architecture after deployment in the target environment, where the model quality can be precisely measured and private edge data can be retained. To achieve efficient and effective edge model generation, we introduce a pretraining-assisted on-cloud model elastification method and an edge-friendly on-device architecture search method. Model elastification generates a high-quality search space of model architectures with the guidance of a developer-specified oracle model. Each subnet in the space is a valid model with different environment affinity, and each device efficiently finds and maintains the most suitable subnet based on a series of edge-tailored optimizations. Extensive experiments on various edge devices demonstrate that our approach is able to achieve significantly better accuracy-latency tradeoffs (e.g. 46.74\% higher on average accuracy with a 60\% latency budget) than strong baselines with minimal overhead (13 GPU hours in the cloud and 2 minutes on the edge server).


AI/ML at the Edge: 4 things CIOs should know

#artificialintelligence

And latency almost always matters when it comes to running artificial intelligence/machine learning (AI/ML) workloads. Great AI requires a lot of data, and it demands it immediately." That's both the blessing and the curse in any sector – industrial and manufacturing are prominent examples, but the principle applies widely across businesses – that generates tons of machine data outside of their centralized clouds or data centers and wants to feed it to an ML model or other form of automation for any number of purposes. Whether you're working with IoT data on a factory floor, or medical diagnostic data in a healthcare facility – or one of many other scenarios where AI/ML use cases are rolling out – you probably can't do so optimally if you're trying to send everything (or close to it) on a round-trip from the edge to the cloud and back again. In fact, if you're dealing with huge volumes of data, your trip might never get off the ground. "I've seen situations in manufacturing facilities ...


Lenovo Harnesses AI at the Edge with ThinkEdge SE450 Launch

#artificialintelligence

The News: Lenovo Infrastructure Solutions Group (ISG) announced the expansion of the Lenovo ThinkEdge portfolio with the introduction of the new ThinkEdge SE450 server, designed to deliver an artificial intelligence (AI) platform directly at the edge to potentially accelerate business insights. The ThinkEdge SE450 could advance intelligent edge capabilities with AI-ready technology that provides faster insights and enhanced computing performance to more environments and accelerate real-time decision making at the edge. Analyst Take: I was interested to see Lenovo bolstering its overall AI edge proposition with the launch of its new ThinkEdge SE450 server product. The new offering is purpose-built for the edge by supporting the compact form factor, power efficiency optimization, ruggedized package, low acoustics, wireless connectivity, and remote management capabilities required to meet challenging edge environments. These environments include retail, manufacturing, smart city, and telecom settings where factors such as harsh conditions, extreme temperatures, and security mandates predicate the AI-enabled ThinkEdge SE450 GPU-based server design and support.